Ternary Tree Optimalization for n-gram Indexing
نویسندگان
چکیده
N-gram indexing is used in many practical applications. Spam detection, plagiarism detection or comparison of DNA reads. There are many data structures that can be used for this purpose, each with different characteristics. In this article the ternary search tree data structure is used. One improvement of ternary tree that can save up to 43% of required memory is introduced. In the second part new data structure, named ternary forest, is proposed. Efficiency of ternary forest is tested and compared to ternary search tree and two-level indexing ternary search tree.
منابع مشابه
Efficient In-memory Data Structures for n-grams Indexing
Indexing n-gram phrases from text has many practical applications. Plagiarism detection, comparison of DNA of sequence or spam detection. In this paper we describe several data structures like hash table or B+ tree that could store n-grams for searching. We perform tests that shows their advantages and disadvantages. One of neglected data structure for this purpose, ternary search tree, is deep...
متن کاملA succinct data structure for self-indexing ternary relations
The representation of binary relations has been intensively studied and many different theoretical and practical representations have been proposed to answer the usual queries in multiple domains. However, ternary relations have not received as much attention, even though many real-world applications require the processing of ternary relations. In this paper we present a new compressed and self...
متن کاملA succint data structure for self-indexing ternary relations
The representation of binary relations has been intensively studied and many different theoretical and practical representations have been proposed to answer the usual queries in multiple domains. However, ternary relations have not received as much attention, even though many real-world applications require the processing of ternary relations. In this paper we present a new compressed and self...
متن کاملThe Treegram Index|an Eecient Technique for Retrieval in Linguistic Treebanks under Consideration for Other Conferences (specify)? Acl
In computational linguistics, large tree databases tagged with morpho-syntactic information are in need of fast retrieval of multiway tree structures. To tackle this problem, we present a generalization of the classical n-gram indexing technique called Treegram indexing. As an application of treegram indexing, we describe the Venona retrieval system, which handles the BH t treebank containing 5...
متن کاملMultiway-Tree Retrieval Based on Treegrams
Large tree databases as knowledge repositories become more and more important; a prominent example are the treebanks in computational linguistics: text corpora consisting of up to five million words tagged with syntactic information. Consequently, these large amounts of structured data pose the problem of fast tree retrieval: Given a database T of labeled multiway trees and a query tree q, find...
متن کامل